视频预测是一个重要但充满挑战的问题。负担着生成未来框架和学习环境动态的任务。最近,通过将视频预测分为两个子问题:预训练图像生成器模型,随后学习图像生成器的潜在空间中的自动回归预测模型,可以将视频预测分为两个子问题,从而成为强大的视频预测工具。 。但是,成功产生高保真性和高分辨率视频尚待观察。在这项工作中,我们研究了如何培训自回归潜在的潜在视频预测模型,能够预测高保真的未来帧,并对现有模型进行最小的修改,并产生高分辨率(256x256)视频。具体而言,我们通过使用因果变压器模型采用高保真图像发生器(VQ-GAN)来扩展先前的模型,并引入TOP-K采样和数据增强的其他技术,以进一步提高视频预测质量。尽管简单起见,但提出的方法仍可以在标准视频预测基准的最新方法中实现竞争性能,而参数较少,并在复杂和大规模数据集上实现了高分辨率的视频预测。视频可从https://sites.google.com/view/harp-videos/home获得。
translated by 谷歌翻译
基于视觉模型的增强学习(RL)有可能从视觉观察中实现样品有效的机器人学习。然而,当前的方法通常会训练单个模型端到端,以学习视觉表示和动态,因此难以准确地对机器人和小物体之间的相互作用进行建模。在这项工作中,我们介绍了一个基于视觉模型的RL框架,该框架将视觉表示学习和动态学习取消。具体而言,我们训练具有卷积层和视觉变压器(VIT)的自动编码器,以重建具有掩盖卷积特征的像素,并学习一个潜在的动力学模型,该模型在自动编码器的表示形式上运行。此外,为了编码与任务相关的信息,我们为自动编码器引入了辅助奖励预测目标。我们使用环境互动收集的在线样本不断更新自动编码器和动态模型。我们证明,我们的去耦方法在Meta-World和rlbench的各种视觉机器人任务上实现了最先进的表现,例如,我们在Meta-World的50个视觉机器人操作任务上实现了81.7%的成功率,而元世界则达到了81.7%基线达到67.9%。代码可在项目网站上找到:https://sites.google.com/view/mwm-rl。
translated by 谷歌翻译
最近无监督的预训练方法已证明通过学习多个下游任务的有用表示,对语言和视觉域有效。在本文中,我们研究了这种无监督的预训练方法是否也可以有效地基于视觉的增强学习(RL)。为此,我们介绍了一个框架,该框架学习了通过视频的生成预训练来理解动态的表示形式。我们的框架由两个阶段组成:我们预先培训无动作的潜在视频预测模型,然后利用预训练的表示形式在看不见的环境上有效地学习动作条件的世界模型。为了在微调过程中纳入其他动作输入,我们引入了一种新的体系结构,该结构将动作条件潜在预测模型堆叠在预先训练的无动作预测模型之上。此外,为了更好地探索,我们提出了一种基于视频的内在奖励,以利用预培训的表示。我们证明,在各种操纵和运动任务中,我们的框架显着改善了基于视力的RL的最终性能和样本效率。代码可在https://github.com/younggyoseo/apv上找到。
translated by 谷歌翻译
目标条件的等级加强学习(HRL)显示了解决复杂和长地平线的rl任务的有希望的结果。然而,目标条件的HRL中高级政策的动作空间通常很大,因此它导致勘探差,导致培训效率低下。在本文中,我们呈现了地标(HIGL)指导的等级强化学习,这是一种培训高级政策的新框架,其具有划分的有希望的国家探索的有希望的国家。 HIGL的关键组成部分是双重的:(a)对勘探和(b)提供信息的采样标志性,鼓励高级政策为选定的地标产生子群。对于(a),我们考虑两个标准:覆盖整个访问的状态空间(即状态的分散)和状态的新颖(即,状态的预测误差)。 for(b),我们选择一个地标作为最短路径中的第一个地标,其节点是地标的图形。我们的实验表明,由于地标引导的有效探索,我们的框架占各种控制权的现有技术。
translated by 谷歌翻译
In this paper, we propose a diffusion-based face swapping framework for the first time, called DiffFace, composed of training ID conditional DDPM, sampling with facial guidance, and a target-preserving blending. In specific, in the training process, the ID conditional DDPM is trained to generate face images with the desired identity. In the sampling process, we use the off-the-shelf facial expert models to make the model transfer source identity while preserving target attributes faithfully. During this process, to preserve the background of the target image and obtain the desired face swapping result, we additionally propose a target-preserving blending strategy. It helps our model to keep the attributes of the target face from noise while transferring the source facial identity. In addition, without any re-training, our model can flexibly apply additional facial guidance and adaptively control the ID-attributes trade-off to achieve the desired results. To the best of our knowledge, this is the first approach that applies the diffusion model in face swapping task. Compared with previous GAN-based approaches, by taking advantage of the diffusion model for the face swapping task, DiffFace achieves better benefits such as training stability, high fidelity, diversity of the samples, and controllability. Extensive experiments show that our DiffFace is comparable or superior to the state-of-the-art methods on several standard face swapping benchmarks.
translated by 谷歌翻译
For change detection in remote sensing, constructing a training dataset for deep learning models is difficult due to the requirements of bi-temporal supervision. To overcome this issue, single-temporal supervision which treats change labels as the difference of two semantic masks has been proposed. This novel method trains a change detector using two spatially unrelated images with corresponding semantic labels such as building. However, training on unpaired datasets could confuse the change detector in the case of pixels that are labeled unchanged but are visually significantly different. In order to maintain the visual similarity in unchanged area, in this paper, we emphasize that the change originates from the source image and show that manipulating the source image as an after-image is crucial to the performance of change detection. Extensive experiments demonstrate the importance of maintaining visual information between pre- and post-event images, and our method outperforms existing methods based on single-temporal supervision. code is available at https://github.com/seominseok0429/Self-Pair-for-Change-Detection.
translated by 谷歌翻译
In recent years, generative models have undergone significant advancement due to the success of diffusion models. The success of these models is often attributed to their use of guidance techniques, such as classifier and classifier-free methods, which provides effective mechanisms to trade-off between fidelity and diversity. However, these methods are not capable of guiding a generated image to be aware of its geometric configuration, e.g., depth, which hinders the application of diffusion models to areas that require a certain level of depth awareness. To address this limitation, we propose a novel guidance approach for diffusion models that uses estimated depth information derived from the rich intermediate representations of diffusion models. To do this, we first present a label-efficient depth estimation framework using the internal representations of diffusion models. At the sampling phase, we utilize two guidance techniques to self-condition the generated image using the estimated depth map, the first of which uses pseudo-labeling, and the subsequent one uses a depth-domain diffusion prior. Experiments and extensive ablation studies demonstrate the effectiveness of our method in guiding the diffusion models toward geometrically plausible image generation. Project page is available at https://ku-cvlab.github.io/DAG/.
translated by 谷歌翻译
Deep learning-based weather prediction models have advanced significantly in recent years. However, data-driven models based on deep learning are difficult to apply to real-world applications because they are vulnerable to spatial-temporal shifts. A weather prediction task is especially susceptible to spatial-temporal shifts when the model is overfitted to locality and seasonality. In this paper, we propose a training strategy to make the weather prediction model robust to spatial-temporal shifts. We first analyze the effect of hyperparameters and augmentations of the existing training strategy on the spatial-temporal shift robustness of the model. Next, we propose an optimal combination of hyperparameters and augmentation based on the analysis results and a test-time augmentation. We performed all experiments on the W4C22 Transfer dataset and achieved the 1st performance.
translated by 谷歌翻译
Traditional weather forecasting relies on domain expertise and computationally intensive numerical simulation systems. Recently, with the development of a data-driven approach, weather forecasting based on deep learning has been receiving attention. Deep learning-based weather forecasting has made stunning progress, from various backbone studies using CNN, RNN, and Transformer to training strategies using weather observations datasets with auxiliary inputs. All of this progress has contributed to the field of weather forecasting; however, many elements and complex structures of deep learning models prevent us from reaching physical interpretations. This paper proposes a SImple baseline with a spatiotemporal context Aggregation Network (SIANet) that achieved state-of-the-art in 4 parts of 5 benchmarks of W4C22. This simple but efficient structure uses only satellite images and CNNs in an end-to-end fashion without using a multi-model ensemble or fine-tuning. This simplicity of SIANet can be used as a solid baseline that can be easily applied in weather forecasting using deep learning.
translated by 谷歌翻译
Traversability estimation for mobile robots in off-road environments requires more than conventional semantic segmentation used in constrained environments like on-road conditions. Recently, approaches to learning a traversability estimation from past driving experiences in a self-supervised manner are arising as they can significantly reduce human labeling costs and labeling errors. However, the self-supervised data only provide supervision for the actually traversed regions, inducing epistemic uncertainty according to the scarcity of negative information. Negative data are rarely harvested as the system can be severely damaged while logging the data. To mitigate the uncertainty, we introduce a deep metric learning-based method to incorporate unlabeled data with a few positive and negative prototypes in order to leverage the uncertainty, which jointly learns using semantic segmentation and traversability regression. To firmly evaluate the proposed framework, we introduce a new evaluation metric that comprehensively evaluates the segmentation and regression. Additionally, we construct a driving dataset `Dtrail' in off-road environments with a mobile robot platform, which is composed of a wide variety of negative data. We examine our method on Dtrail as well as the publicly available SemanticKITTI dataset.
translated by 谷歌翻译